SHapley Additive exPlanations (SHAP) - Homework 2 - Karol Pustelnik

Shapley values are a way to quantify the contribution of each feature to the model's prediction. They are a generalization of the concept of marginal contribution to a coalition. In this homework, we will use the SHAP package and dalex to compute Shapley values for diffrent models trained on the Heart attack dataset. We will compare the results of these two packages and discuss the differences. We will also interpret the shapley values and discuss variable importance. Furthermore we will compare the shapley values calculated for two diffrent models: xgboost and logistic regression. We will investigate in detail the model prediction for selected obserations and discuss in detail the shapley values for these observations.

Importing Packages & loading data

age sex cp trtbps chol fbs restecg thalachh exng oldpeak slp caa thall output
0 63 1 3 145 233 1 0 150 0 2.3 0 0 1 1
1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 1
2 41 0 1 130 204 0 0 172 0 1.4 2 0 2 1
3 56 1 1 120 236 0 1 178 0 0.8 2 0 2 1
4 57 0 0 120 354 0 1 163 1 0.6 2 0 2 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
298 57 0 0 140 241 0 1 123 1 0.2 1 0 3 0
299 45 1 3 110 264 0 1 132 0 1.2 1 0 3 0
300 68 1 0 144 193 1 1 141 0 3.4 1 2 3 0
301 57 1 0 130 131 0 1 115 1 1.2 1 1 3 0
302 57 0 1 130 236 0 0 174 0 0.0 1 1 2 0

303 rows × 14 columns

Data analysis

Data consists of 13 features and 1 column of labels (output). Dimension of the data is: (303,14). The features are:

Age : Age of the patient - categorical

Sex : Sex of the patient (1 male, 0 female) - categorical - 0

cp : Chest Pain type chest pain type - categorical

Value 0: typical angina
Value 1: atypical angina
Value 2: non-anginal pain
Value 3: asymptomatic

trtbps : resting blood pressure (in mm Hg) - continuous

chol : cholestoral in mg/dl fetched via BMI sensor - continuous

fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) - categorical

rest_ecg : resting electrocardiographic results - categorical

Value 0: normal
Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

thalach : maximum heart rate achieved - continuous

exng: exercise induced angina (1 = yes; 0 = no) - categorical

oldpeak: ST depression induced by exercise relative to rest

slp : the slope of the peak exercise ST segment - categorical

Value 0: upsloping
Value 1: flat
Value 2: downsloping
Values 3: no information


caa: number of major vessels (0-4) - categorical

thall : thallium stress result - categorical

Value 0: normal
Value 1: fixed defect
Value 2: reversi

output : 0 = no heart disease, 1 = heart disease - categorical

One hot encoding of categorical features

The categorical features are one hot encoded because because they are not ordinal. The one hot encoding is done using the pandas get_dummies function.

age trtbps chol thalachh oldpeak output sex_0 sex_1 cp_0 cp_1 ... slp_2 caa_0 caa_1 caa_2 caa_3 caa_4 thall_0 thall_1 thall_2 thall_3
0 63 145 233 150 2.3 1 0 1 0 0 ... 0 1 0 0 0 0 0 1 0 0
1 37 130 250 187 3.5 1 0 1 0 0 ... 0 1 0 0 0 0 0 0 1 0
2 41 130 204 172 1.4 1 1 0 0 1 ... 1 1 0 0 0 0 0 0 1 0
3 56 120 236 178 0.8 1 0 1 0 1 ... 1 1 0 0 0 0 0 0 1 0
4 57 120 354 163 0.6 1 1 0 1 0 ... 1 1 0 0 0 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
298 57 140 241 123 0.2 0 1 0 1 0 ... 0 1 0 0 0 0 0 0 0 1
299 45 110 264 132 1.2 0 0 1 0 0 ... 0 1 0 0 0 0 0 0 0 1
300 68 144 193 141 3.4 0 0 1 1 0 ... 0 0 0 1 0 0 0 0 0 1
301 57 130 131 115 1.2 0 0 1 1 0 ... 0 0 1 0 0 0 0 0 0 1
302 57 130 236 174 0.0 0 1 0 0 1 ... 0 0 1 0 0 0 0 0 1 0

303 rows × 31 columns

Before creating a model, I will split the data to train and test sets. I will use 80% of the data for training and 20% for testing.

Train test split

I will use train_test_split from sklearn.model_selection to split the data to train and test sets. I will use 80% of the data for training and 20% for testing. I will use random_state=42 to make the results reproducible.

Training models: XGBoost

XGBoost is a gradient boosting framework that uses a tree based learning algorithm. It is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. It provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

recall precision f1 accuracy auc
GBM 0.8125 0.896552 0.852459 0.852459 0.918103

As we can see from the above table, the XGBoost model achieved very good performance.

The model achieved recall score of ~0.81, meaning that for 100 patients that actually have heart disease, 81 of them are predicted to have heart disease by the model.

The model achieved precision score of ~0.89, meaning that for 100 patients that are predicted to have heart disease, 89 of them actually have heart disease.

The model achieved accuracy score of ~0.86, meaning that for 100 patients, 86 of them are predicted correctly by the model.

The F1 score of the model is ~0.85. This means that the model is good at predicting heart disease. The F1 score is the harmonic mean of precision and sensitivity.

The AUC score of the model is ~0.92. This means that the model is good at distinguishing between patients with heart disease and patients without heart disease.

Model prediction & explanation on chosen patients

I will create new model on the whole dataset and use it to predict the heart disease of 2 selected patients. I will also explain the model's prediction using SHAP values.

The first patient, number 56, is a Male, aged 48 years, with chest pain type 0 (typical angina), resting blood pressure of 122 mm Hg, cholestoral of 222 mg/dl, fasting blood sugar of 0 (false), resting electrocardiographic results of 0 (normal), maximum heart rate achieved of 186, exercise induced angina of 0 (no), ST depression induced by exercise relative to rest of 0.0, the slope of the peak exercise ST segment of 2, number of major vessels of 0, and thalassemia of 2 (normal).

The model predicts that the patient has a ~0.99 probability of having a heart disease. The model's prediction is correct, as the patient actually has a heart disease.

age sex cp trtbps chol fbs restecg thalachh exng oldpeak slp caa thall output
56 48 1 0 122 222 0 0 186 0 0.0 2 0 2 1
Probability of Heart disease: 0.9939232468605042

The second patient, number 167, is a Female aged 62 years, with chest pain type 0 (typical angina), resting blood pressure of 140 mm Hg, cholestoral of 268 mg/dl, fasting blood sugar of 0 (false), resting electrocardiographic results of 0 (normal), maximum heart rate achieved of 160, exercise induced angina of 0 (no), ST depression induced by exercise relative to rest of 3, the slope of the peak exercise ST segment of 0, number of major vessels of 2, and thalassemia of 2 (normal).

The model predicts that the patient has a ~0.019 probability of having heart disease. The model's prediction is correct, as the patient actually does not have a heart disease.

age sex cp trtbps chol fbs restecg thalachh exng oldpeak slp caa thall output
167 62 0 0 140 268 0 0 160 0 3.6 0 2 2 0
 Probability of Heart disease: 0.019955100491642952

Shapley values using package Dalex

Interpretation of SHAP values from package Dalex

Patient 56

Based on the follwing plot, we can interpret the following:

1) Average model response is 0.545

2) The most important features for the model to make a prediction about whether the patient 56 has heart disease are:

1) caa_0 = 1 - number of major vessels of 0 - it increased the probability of having heart disease by ~0.241 percentage points (from ~0.545 to ~0.786).

2) thall_2 = 1 - thalassemia of 2 (normal) - it increased the probability of having heart disease by ~0.101 percentage points (from ~0.545 to ~0.646).

3) cp_0 = 1 - chest pain type 0 (typical angina) - it decreased the probability of having heart disease by ~0.058 percentage points (from ~0.545 to ~0.487).

4) oldpeak = 0 - ST depression induced by exercise relative to rest - it increased the probability of having heart disease by ~0.088 percentage points (from ~0.545 to ~0.582).

5) chol = 222 - cholestoral in mg/dl fetched via BMI sensor - it increased the probability of having heart disease by ~0.073 percentage points (from ~0.545 to ~0.618).

6) thall_3 = 0 - lack of thalassemia of 3 (reversable defect) - it increased the probability of having heart disease by ~0.044 percentage points (from ~0.545 to ~0.589).

7) tbalachh = 186 - maximum heart rate achieved - it increased the probability of having heart disease by ~0.015 percentage points (from ~0.545 to ~0.560).

8) sex_0 = 0.0 - being Male decreased the probability of having heart disease by ~0.03 percentage points (from ~0.545 to ~0.515).

9) exng_0 = 1 - exercise induced angina of 1 (yes) - it increased the probability of having heart disease by ~0.013 percentage points (from ~0.545 to ~0.558).

Patient 167

Based on the follwing plot, we can interpret the following:

1) Average model response is 0.545

2) The most important features for the model to make a prediction about whether the patient 167 has heart disease are:

1) caa_0 = 0 - having more than 0 number of major vessels - it decreased the probability of having heart disease by ~0.269 percentage points (from ~0.545 to ~0.276).

2) cp_0 = 1 - chest pain type 1 (atypical angina) - it decreased the probability of having heart disease by ~0.129 percentage points (from ~0.545 to ~0.416).

3) oldpeak = 3.6 - ST depression induced by exercise relative to rest of 3.6 - it decreased the probability of having heart disease by ~0.088 percentage points (from ~0.545 to ~0.457).

4) age = 62 - age in years - it decreased the probability of having heart disease by ~0.151 percentage points (from ~0.545 to ~0.394).

5) sex_0 = 1 - being Female increased the probability of having heart disease by ~0.072 percentage points (from ~0.545 to ~0.617).

6) slp_1 = 0 - the lack of slope of the peak exercise ST segment of 1 (flat) - it increased the probability of having heart disease by ~0.023 percentage points (from ~0.545 to ~0.568).

7) exng_0 = 1 - exercise induced angina of 1 (yes) - it increased the probability of having heart disease by ~0.018 percentage points (from ~0.545 to ~0.563).

Conclusion

The effects of each variables on the probability of having heart disease differ depending on the patient, because shap values are local. However, for both patients, the most important feature for the model to make a prediction about whether the patient has heart disease is the number of major vessels (caa).

Shap values using package shap

Now, I will use the package shap to calculate the Shapley values for the same 2 patients.

Interpretation of SHAP values from package shap & comparison with Dalex

Overall, the results from both pacakges are similiar in terms of direction of marginal effects of each features. However, the magnitude of the effects are different, because Dalex and shap packages use diffrent algorithms to estimate the shap values. The algorithm from shap package is deterministic, while the algorithm from Dalex package is stochastic. The stochastic algorithm from Dalex package is more accurate, but it is also slower.

Patient 56

Based on the follwing plot, we can interpret the following:

1) Average model response is 0.559

2) The most important features for the model to make a prediction about whether the patient 56 has heart disease are:

1) caa_0 = 1 - number of major vessels of 0 - it increased the probability of having heart disease by ~0.19 percentage points (from ~0.559 to ~0.749).

2) etc.

Locality of SHAP values

To futher show the locality of the SHAP values I will try to find two observations in the dataset, such that they have different variables of the highest importance.

Based on the code, patients 1 and 5 have diffrent variables of the highest importance. The most important feature for the model to make a prediction about whether the patient 1 has heart disease is the feature cp_0 = 0 - chest pain type 1 (atypical angina), while the most important feature for the model to make a prediction about whether the patient 5 has heart disease is the feature oldpeak = 0.4 - ST depression induced by exercise relative to rest of 0.4.

First patient based on code: 1
Second patient based on code: 5

Let's further investigate the locality of SHAP values

I will try to select one variable X and find two observations in the dataset such that for one observation, X has a positive attribution, and for the other observation, X has a negative attribution.

As variable X I will choose age. The first observation would patient 1, for whom the feature age has a positive attribution.

Second selected patient based on code: 14

As we can see from the plot below, the feature age has a positive attribution for patient 1, while it has a negative attribution for patient 5.

Logistic regression

I will Train another model (logistic regression) and find an observation for which SHAP attributions are different between this model and the xgboost.

As we can see from the plots below, two diffrent models (xgboost and logistic regression) have diffrent attributions for the same observation (patient 56). For instance, the variable tbalachh has a attribution for the Logistic Regression model of ~0.1, while it has a attribution for the XGBoost model of ~0.015, which is a diffrence of ~0.085. However, generally, the attributions are similar for both models.

Task B

We will now look closely how to calculate Shapley values by hand. Consider a game with 3 players and following payoffs:

v() = 0

v(A) = 20

v(B) = 20

v(C) = 60

v(A,B) = 60

v(A,C) = 70

v(B,C) = 70

v(A,B,C) = 100

Our goal is to calculate the Shapley value player A.

Summary

In this notebook we have seen how to use the package dalex to explain the predictions of a model. We have also seen how to use the package shap to explain the predictions of a model. We have compared the results from both packages and found that the results are similiar in terms of direction of marginal effects of each features. However, the magnitude of the effects were slightly diffrent.

We also succesfully found observations for which the most important SHAP attribution is different. Furthermore, we have found two observations in the dataset, such that they have different sign of attributes. The lack of consistancy shown from these examples does not necesarly mean that the method is not reliable. It is just a reminder that we should be careful when interpreting the results and take into consideration the interaction between the variables.

We also compared two models (xgboost and logistic regression) and found that they have diffrent attributions for the same observation (patient 56). However, generally, the attributions are similar for both models.